Diffuse large B cell lymphoma (DLBCL) exhibits profound genetic heterogeneity that drives disparate clinical outcomes despite standard immunochemotherapy with rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone (R-CHOP). Although emerging treatments such as bispecific T-cell engagers and chimeric antigen receptor (CAR) T-cell therapy provide additional options, a substantial proportion of patients still experience relapse or refractory disease. Molecular classifiers including cell-of-origin and the LymphGen algorithm have improved prognostic stratification, yet some patients remain unclassified or lack clear guidance. Moreover, the clinical utility of diagnostic targeted gene sequencing at presentation remains controversial. We therefore hypothesized that applying a targeted lymphoma panel together with machine-learning feature selection and unsupervised clustering would uncover novel high-risk subgroups and refine prognostic stratification beyond existing frameworks.

We retrospectively sequenced formalin-fixed paraffin-embedded tumors from newly diagnosed DLBCL patients (median age 69.0 years [IQR 59.0–75.5], 56 percent male) using a deep hybrid-capture assay targeting known lymphoma drivers, chromatin modifiers and epigenetic regulators. Sequence reads were aligned to GRCh37 with BWA, variants were called by GATK and non-synonymous mutations with variant allele frequency ≥10% were annotated via the Ensembl Variant Effect Predictor. A Random Forest regression model with 1 000 trees and ten-fold cross-validation selected the top 30 genes prognostic for progression-free survival (PFS) and overall survival (OS). Principal component analysis on these features followed by k-means clustering (k = 2) defined two molecular clusters. Survival endpoints were assessed by Kaplan–Meier analysis with log-rank testing (two-sided p < 0.05).

Cluster 2 was distinguished by a high prevalence of adverse alterations in ATM, TP53, CREBBP, TNFAIP3, CARD11 and TNFRSF14, and by unexpected enrichment of KMT2C and TET2, genes more commonly implicated in other malignancies and not previously highlighted in DLBCL. Feature importance measures revealed that established drivers such as B2M, PTEN, NCOR1, TP53 and CREBBP ranked among the top predictors for OS, while novel candidate genes including PCDH20, NFATC1, UBXN11, SMARCA1 and PDGFRB also emerged. For PFS, canonical genes TP53, ATM, TET2 and KMT2C were prioritized alongside less characterized loci such as KLHL6, CIITA and SMARCA1. Clinically, cluster 2 patients had significantly higher Ann Arbor stage and International Prognostic Index scores than cluster 1, and tended toward elevated lactate dehydrogenase levels. Well-known prognostic clinical factors thus correlate closely with our gene-based stratification. Kaplan–Meier curves showed significantly inferior PFS in cluster 2 compared with cluster 1 (median PFS not reached versus 24 months, p = 0.02), while OS differences did not reach statistical significance (p = 0.11). Within the MCD LymphGen subtype, normally associated with favorable outcomes, patients in cluster 2 experienced both significantly worse PFS and OS (p < 0.05), underscoring substantial heterogeneity within established subtypes. The prognostic value of our clustering remained significant after adjustment for stage, IPI and lactate dehydrogenase, demonstrating concordance with known clinical risk factors.

These results establish that targeted gene sequencing, when combined with machine-learning-driven feature selection and unsupervised clustering, can identify a novel high-risk DLBCL subgroup not captured by current classification systems. The originality of this study lies in the demonstration that machine learning can uncover both canonical and non-canonical genetic drivers and that the resulting cluster retains prognostic factor even within existing LymphGen subtypes. Moreover, our high-risk clustering provides a genetic framework that explains previously recognized clinical prognostic factors. This approach highlights previously unrecognized genes of biological and therapeutic relevance, supports early identification of relapse-prone patients and may inform personalized, risk-adapted therapy. Prospective multicenter validation and functional studies of the high-risk cluster are warranted to advance precision medicine in DLBCL.

This content is only available as a PDF.
Sign in via your Institution